Adding new ARC job runner #16653

maikenp · 2023-09-06T16:25:40Z

This feature is a prototype created in the context of ESG.

A new job runner - for submitting jobs to a remote ARC middleware endpoint. Behind the ARC endpoint is a regular batch system (slurm, htcondor etc). ARC endpoints are typically in front of designated grid infrastructures or HPC systems shared by other users.

The use-case is a user that has access and compute-time to one or several HPC systems that are served by ARC. Instead of using the traditional command-line approach to send the jobs, Galaxy can be used. The authentication is handled by ARC, and the uploading of remote input datasets is also handled by ARC. The user would use the regular way of authenticating himself with ARC via tokens issued by one of the endpoints supported OIDC token providers. For this to be convenient the user should log into Galaxy using this supported OIDC token provider. Currently a WLCG IAM backend is in PR to social_core (python-social-auth/social-core#820) which is one of the supported IdPs. The Galaxy server must be configured to use this oidc as a login-option by configuring the oidc_backends_config.yml. See details here: #16617

Currently, the user must himself configure his user preferences with a single ARC endpoint to use. There is a feature request to ask for the possibility for the user to add as many endpoints are wanted: #16596
The form allowing users to add an ARC endpoint is currently merged into the usegalaxy-eu infrastructure-playbook: usegalaxy-eu/infrastructure-playbook#879 . If you want to try this out, just have a look at this and apply the small additions to the relevant files.

Currently also, the runner only works with a specific tool designated for ARC jobs. It has not yet been made ready for the using with toolbox tools. The arc-tool config file looks like this:

<tool id="hello_arc" name="ARC hello">
  <description>is a simple hello world python script</description>
   <inputs>
   <param name="message" type="text" value="Hi from galaxy" label="Hello message"/>
   <param name="arcjob_exe" type="data" label="ARC executable"/>
   <param name="arcjob_cpuhrs" type="text" value="1" label="Job CPU hours"/>
   <param name="arcjob_memory" type="text" value="100" label="Job memory"/>-->
 </inputs>

  <outputs>
   <collection name="output" type="list"  label="ARC outputs">
     <discover_datasets pattern="__designation__"  directory="./" recurse="true" />
   </collection>
 </outputs>

  <help>
    TODO
  </help>
</tool>

The tool expects that you upload an executable. In this case a simple bash script with the following contents:

#!/bin/bash

/bin/echo "Hello from runscript to arcout1.txt" > ./arcout1.txt
/bin/echo "Hello from runscript to arcout2.txt" > ./arcout2.txt

/bin/echo "Running on compute node: $(hostname)" >> ./arcout1.txt
/bin/echo "Time is $(date)" >> ./arcout1.txt

/bin/echo "Running on compute node: $(hostname)" >> ./arcout2.txt
/bin/echo "Time is $(date)" >> ./arcout2.txt

/bin/echo "Sleeping for 60s " >> ./arcout1.txt
/bin/echo "Sleeping for 60s " >> ./arcout2.txt

>&1 /bin/echo "Some output to stdout from the runhello.sh executable."
/bin/sleep 60

But the executable could very well be some scientific code in C++ or whatever, as long as the ARC endpoint supports this software (this will be handled properly with so-called ARC runtime environment settings and matchmaking later). Currently this tool allows upload of a single executable and no other input-data just for testing purposes. In the future the job runner will support uploading of arbitrary number of input files from galaxy, in addition to remote input files (ARC collects these from the remote source).

Once the job is done, all the output files created by the jobs executable and ARC itself are uploaded into Galaxy as expected.

How to test the changes?

Instructions for manual testing are as follows:
- For the galaxy admin:
  - Install the ARC runner (in this PR) - the pyarcrest (ARC REST client) will be installed as a dependency.
  - Configure the galaxy site to use the WLCG oidc backend (you might have to manually hack social_core backends as the PR mentioned above is not yet merged) - Adding wlcg oidc endpoint. #16617
  - Configure the galaxy site with the ARC test tool (see above)
  - Configure the galaxy site with the user_extra_preferences ARC distributed computing form: Added simple form for adding ARC compute element endpoint usegalaxy-eu/infrastructure-playbook#879
- For the user:
  - Once the site is configured as above, you need to get authorized to get a token from WLCG IAM: https://wlcg.cloud.cnaf.infn.it/login Then log into galaxy via the WLCG login-button.
  - Set your user-preferences to use an ARC endpoint that you have access to. You can use the test-ARC endpoint https://arctestcluster-slurm-el8-arc7-ce1.cern-test.uiocloud.no
  - Create an executable file and upload it to the ARC tool.
  - Submit the job - and sit back and relax.

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

Adding pyarcrest conditional requirement for the ArcRESTJobRunner

ArcRESTJobRunner

Require version 0.2 which is compatible with the ARC job runner.

jmchilton · 2023-09-07T14:34:05Z

I have a couple refactorings I consider relatively important - can you pull them out of https://github.com/jmchilton/galaxy/tree/arc_job_runner.

This brings it inline with other job runners, makes it more isolated for potential unit testing, simplifies some big overloaded methods in the runner, and would allow a lot of this logic to be reused with a potential Pulsar job runner. I think Pulsar is the way you want to go long term but I am happy to allow a Galaxy job runner for this as long as you don't get into like command-rewriting logic - which there isn't here - you're staging paths as is. That staging could be more targeted with Pulsar but I'm fine with being weary of premature optimization.

I haven't tested my changes - I think the refactorings are vaguely correct but I probably introduced some bugs - hopefully if there are regressions they will be easy to track down but if I broke something and you want advice - happy to work with you on that.

My next refactoring would be this.... almost global ARCJob thing? You're reusing the same object for every job and every request? Shouldn't this be somehow constructed per job? Is there an issue with that? I'd be very anxious about threading and cleaning things up and such as things are now?

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

I think addressing these three points would make it easier for me to grok what is going on with the job runner and provide additional feedback.

jmchilton · 2023-09-07T15:07:25Z

lib/galaxy/jobs/runners/arc.py

+
+
+    def job_actions(self, arcid, action):
+


Also - can you just make this two separate methods with two separate returns. This sort method dispatching weakens our tool chains ability to do static checking and I think makes the code a little bit harder to read.

Thanks, will look at it!

jdavcs · 2023-09-07T15:15:25Z

lib/galaxy/jobs/runners/arc.py

+                    log.error(f'Could not get status of job with id: {arcid}')
+        elif action == "kill":
+            results = self.arcrest.killJobs([arcid])
+            for killed in results:


This will return after the first iteration, so really there's no iteration - is that intentional?
Also, you could just return bool(killed) instead of the if/then.

jdavcs · 2023-09-07T15:31:20Z

lib/galaxy/jobs/runners/arc.py

+        self._init_monitor_thread()
+        self._init_worker_threads()


These methods are called automatically on galaxy startup; the logic goes like this: app.py > JobManager().start() > JobHandler().start() > DefaultJobDisplatcher().start() > for each job runner runner.start() where both methods are called here. Unless this is a special case (which I missed)?

I did not review/check if this is necessary, I just followed https://docs.galaxyproject.org/en/master/dev/build_a_job_runner.html
Maybe it is outdated?

Yeah, that part of the documentation can be interpreted both ways (i.e., this is what the runner does vs. this is the code that should be added). I'd leave it out - it should work: all runners start those threads, which is why it's handled in the base runner.

jdavcs · 2023-09-07T15:33:33Z

To get rid of the linting errors, you can just make format and tox -e mypy before pushing.

maikenp · 2023-09-07T15:57:58Z

Thanks for looking closely at this and spotting issues and suggesting improvements. I will implement the changes.

…cb9cc4 by jmchilton.

…from jmchilton jmchilton@96d3807

…hilton@1906b2b

maikenp · 2023-09-08T07:52:00Z

I have a couple refactorings I consider relatively important - can you pull them out of https://github.com/jmchilton/galaxy/tree/arc_job_runner.

This brings it inline with other job runners, makes it more isolated for potential unit testing, simplifies some big overloaded methods in the runner, and would allow a lot of this logic to be reused with a potential Pulsar job runner. I think Pulsar is the way you want to go long term but I am happy to allow a Galaxy job runner for this as long as you don't get into like command-rewriting logic - which there isn't here - you're staging paths as is. That staging could be more targeted with Pulsar but I'm fine with being weary of premature optimization.

I haven't tested my changes - I think the refactorings are vaguely correct but I probably introduced some bugs - hopefully if there are regressions they will be easy to track down but if I broke something and you want advice - happy to work with you on that.

My next refactoring would be this.... almost global ARCJob thing? You're reusing the same object for every job and every request? Shouldn't this be somehow constructed per job? Is there an issue with that? I'd be very anxious about threading and cleaning things up and such as things are now?

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

I think addressing these three points would make it easier for me to grok what is going on with the job runner and provide additional feedback.

Thanks a lot for these suggestions, improves code a lot. I have applied the changes, and will test now.

About the ARCJob - I might not need this anymore in fact. It was needed in an earlier version of pyarcrest, but in the meantime we have done some refactoring and I think it is of no use anymore. So will look into that now. And yes, your concern is very correct related to the ARCJob per job issue, I noticed what I had done the other day and noted that I had to have a look at what I had actually implemented there, and that it seemed wrong.

…cessary job_actions method. Further improvements to the action on jobs will come.

maikenp · 2023-09-08T13:44:29Z

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

@jmchilton

I am rewriting to be able to use the job_runner_external_id. However, in check_watched_item - I seem to have difficulties getting id.

Is
job_state.job_wrapper.get_job().get_job_runner_external_id() supposed to have access to this id?

It returns None for me.

Even though I in queue_job do:

job_wrapper.get_job().job_runner_external_id = arc_jobid
And have verified that at this point that
job_wrapper.get_job().get_job_runner_external_id() indeed returns the correct ARC id

But in check_watched_item the get_job_runner_external_id() returns None.
I also tried just job_runner_external_id() but same result.

And by the way: using a getter to get the id, it would be nice to use a setter to set it. But I see the setter is an attribute of the Task object, and not the Job object. Any opinions on this?

maikenp · 2023-09-11T16:30:56Z

Also the job object has a job_runner_external_id attribute that should prevent the need to maintain a huge dictionary that never gets cleaned up (my understanding of arc_jobid = self.arc.job_mapping[job_state.job_id]). Can you use that or is there something I am missing.

@jmchilton

I am rewriting to be able to use the job_runner_external_id. However, in check_watched_item - I seem to have difficulties getting id.

Is job_state.job_wrapper.get_job().get_job_runner_external_id() supposed to have access to this id?

It returns None for me.

Even though I in queue_job do:

job_wrapper.get_job().job_runner_external_id = arc_jobid And have verified that at this point that job_wrapper.get_job().get_job_runner_external_id() indeed returns the correct ARC id

But in check_watched_item the get_job_runner_external_id() returns None. I also tried just job_runner_external_id() but same result.

And by the way: using a getter to get the id, it would be nice to use a setter to set it. But I see the setter is an attribute of the Task object, and not the Job object. Any opinions on this?

Sorted this out.

…n part.

maikenp · 2023-09-13T13:29:41Z

About testing - so finally (sorry for the delay and multiple tries) fixed whitespace linting errors and other stuff actively using

tox -e lint
tox -e mypy

Now trying to sort out the

tox -e check_indexes

error I see here, but running locally all is ok.

<snip>
The Galaxy client build is being skipped due to the SKIP_CLIENT_BUILD environment variable.
check_indexes: commands[1]> bash manage_db.sh init
Activating virtualenv at .venv
INFO:galaxy.model.migrations:Creating database for URI [sqlite:///./database/universe.sqlite?isolation_level=IMMEDIATE]
INFO:alembic.runtime.migration:Context impl SQLiteImpl.
INFO:alembic.runtime.migration:Will assume non-transactional DDL.
INFO:alembic.runtime.migration:Running stamp_revision  -> 987ce9839ecb
DEBUG:alembic.runtime.migration:new branch insert 987ce9839ecb
INFO:alembic.runtime.migration:Context impl SQLiteImpl.
INFO:alembic.runtime.migration:Will assume non-transactional DDL.
INFO:alembic.runtime.migration:Running stamp_revision  -> d4a650f47a3c
DEBUG:alembic.runtime.migration:new branch insert d4a650f47a3c
check_indexes: commands[2]> bash check_model.sh
Activating virtualenv at .venv
/storage/gitrepos/forks/galaxy/./scripts/check_model.py:50: RemovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings.  Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
  metadata = MetaData(bind=create_engine(db_url))
  check_indexes: OK (259.65=setup[0.14]+cmd[242.92,13.78,2.80] seconds)
  congratulations :) (259.71 seconds)

But in the repo tests I see:

Successfully installed httplib2-0.22.0 oauth2client-4.1.3 psycopg2-binary-2.9.5 pyasn1-modules-0.3.0 pykube-0.15.0
Installing node into /home/runner/work/galaxy/galaxy/galaxy root/.venv with nodeenv.
 * Install prebuilt node (18.12.1) .Failed to download from https://nodejs.org/download/release/v18.12.1/node-v18.12.1-linux-x64.tar.gz
..
Traceback (most recent call last):
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/bin/nodeenv", line 10, in <module>
    sys.exit(main())
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 1122, in main
    create_environment(env_dir, args)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 998, in create_environment
    install_node(env_dir, src_dir, args)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 755, in install_node
    install_node_wrapped(env_dir, src_dir, args)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 790, in install_node_wrapped
    copy_node_from_prebuilt(env_dir, src_dir, args.node)
  File "/home/runner/work/galaxy/galaxy/galaxy root/.venv/lib/python3.7/site-packages/nodeenv.py", line 681, in copy_node_from_prebuilt
    src_folder, = glob.glob(src_folder_tpl)
ValueError: not enough values to unpack (expected 1, got 0)
check_indexes: exit 1 (157.23 seconds) /home/runner/work/galaxy/galaxy/galaxy root> bash scripts/common_startup.sh pid=2159
  check_indexes: FAIL code 1 (159.03=setup[1.79]+cmd[157.23] seconds)
  evaluation failed :( (159.43 seconds)
Error: Process completed with exit code 1.

What to do to fix this error?

jdavcs · 2023-09-13T13:55:49Z

I've restarted those tests - they've just passed, so the previous error was unrelated.

maikenp · 2023-09-14T07:12:57Z

The failed jobs - not sure how I should fix this. Is it related or unrelated to my changes?

maikenp and others added 6 commits September 1, 2023 14:12

Update conditional-requirements.txt

3c2c67f

Adding pyarcrest conditional requirement for the ArcRESTJobRunner

Update __init__.py

54770d1

ArcRESTJobRunner

explicitly document default of multiple

bffcef8

Merge remote-tracking branch 'upstream/dev' into dev

7b8f5b6

Adding new job runner - the ARC job runner

1dc557e

Merge branch 'pyarcrest-dep' into arc_job_runner

46a53a4

github-actions bot added area/jobs area/dependencies labels Sep 6, 2023

github-actions bot added this to the 23.2 milestone Sep 6, 2023

maikenp marked this pull request as draft September 6, 2023 16:28

Update conditional-requirements.txt

729c275

Require version 0.2 which is compatible with the ARC job runner.

jmchilton reviewed Sep 7, 2023

View reviewed changes

jdavcs reviewed Sep 7, 2023

View reviewed changes

jdavcs added the kind/feature label Sep 7, 2023

jdavcs reviewed Sep 7, 2023

View reviewed changes

jdavcs mentioned this pull request Sep 7, 2023

Clarify documentation on how to build a job runner #16665

Merged

4 tasks

Maiken Pedersen added 3 commits September 8, 2023 09:34

Reusable abstractions for dealing with pyarcrest. Applying jmchilton@6…

3a29632

…cb9cc4 by jmchilton.

Refactor XML parsing out of middle of arc job runner. Applying patch …

75c07fd

…from jmchilton jmchilton@96d3807

Remove unused imports in arc runner. Applied patch from jmchilton jmc…

63d5002

…hilton@1906b2b

Fixed one job description omission related to executable. Removed une…

c30b651

…cessary job_actions method. Further improvements to the action on jobs will come.

Maiken Pedersen added 3 commits September 8, 2023 16:34

linting

2a07bbc

More linting

7c8f1fd

Refactored to remove need for internal job id mapping.

b6acefb

Refactoring and simplification. Especially for the ARC job descriptio…

91a20f3

…n part.

Maiken Pedersen added 8 commits September 13, 2023 12:56

Fix last-minute change typo

88451b1

Forgot to also commit the changed arc_util.py

dda0a0e

Remove unnecessary open mode parameters

894cc9a

Remove comment - linting

bd1d30f

Hopefully fixed all linting errors now

678dd76

more linting of whitespace

83417f8

More linting and some bugfix related to external job id

75087af

linting whitespace

d1f2fb0

maikenp marked this pull request as ready for review September 13, 2023 20:33

maikenp marked this pull request as draft September 15, 2023 08:27

maikenp mentioned this pull request Sep 27, 2023

Draft: Arc runner computeenv #16750

Draft

4 tasks

mvdbeek modified the milestones: 23.2, 24.0 Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new ARC job runner #16653

Adding new ARC job runner #16653

maikenp commented Sep 6, 2023 •

edited

Loading

jmchilton commented Sep 7, 2023

jmchilton Sep 7, 2023

maikenp Sep 7, 2023

jdavcs Sep 7, 2023

jdavcs Sep 7, 2023

maikenp Sep 7, 2023

jdavcs Sep 7, 2023

jdavcs commented Sep 7, 2023

maikenp commented Sep 7, 2023

maikenp commented Sep 8, 2023

maikenp commented Sep 8, 2023 •

edited

Loading

maikenp commented Sep 11, 2023

maikenp commented Sep 13, 2023

jdavcs commented Sep 13, 2023

maikenp commented Sep 14, 2023

Adding new ARC job runner #16653

Are you sure you want to change the base?

Adding new ARC job runner #16653

Conversation

maikenp commented Sep 6, 2023 • edited Loading

How to test the changes?

License

jmchilton commented Sep 7, 2023

jmchilton Sep 7, 2023

Choose a reason for hiding this comment

maikenp Sep 7, 2023

Choose a reason for hiding this comment

jdavcs Sep 7, 2023

Choose a reason for hiding this comment

jdavcs Sep 7, 2023

Choose a reason for hiding this comment

maikenp Sep 7, 2023

Choose a reason for hiding this comment

jdavcs Sep 7, 2023

Choose a reason for hiding this comment

jdavcs commented Sep 7, 2023

maikenp commented Sep 7, 2023

maikenp commented Sep 8, 2023

maikenp commented Sep 8, 2023 • edited Loading

maikenp commented Sep 11, 2023

maikenp commented Sep 13, 2023

jdavcs commented Sep 13, 2023

maikenp commented Sep 14, 2023

maikenp commented Sep 6, 2023 •

edited

Loading

maikenp commented Sep 8, 2023 •

edited

Loading